Skip to content

Feat/binning transformation#801

Open
raivo-otus wants to merge 13 commits intomicrobiome:develfrom
raivo-otus:feat/binning_transformation
Open

Feat/binning transformation#801
raivo-otus wants to merge 13 commits intomicrobiome:develfrom
raivo-otus:feat/binning_transformation

Conversation

@raivo-otus
Copy link
Contributor

@raivo-otus raivo-otus commented Jan 15, 2026

Adds a quantile based binning transformation to the mia::transformAssay() -function, as discussed in #800
Default value of bin = 4, reflects roughly division to "rare, low, medium, high" which is easy to understand.
Unit tests include checks for both sample- and feature-wise transforms.

Should the transformation default to using "relabundance" assay, or leave choice to user discretion?

Pending tasks:

  • Add documentation
  • Add information on binning transformation to OMA

Potential optimizations;

  • Adding a parallel version, with e.g. doFuture, can be beneficial for large datasets.
  • C++ implementation called with Rcpp would be even better, both for serial and parallel performance.

@raivo-otus raivo-otus marked this pull request as draft January 15, 2026 14:05
Copy link
Contributor

@TuomasBorman TuomasBorman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple comments

@TuomasBorman
Copy link
Contributor

Should the transformation default to using "relabundance" assay, or leave choice to user discretion?

As the binning is based on ranks, shouldn't counts and relabundance lead to same result?

@TuomasBorman
Copy link
Contributor

Also discussed over lunch; you could update OMA's ML chapter if binning improves the accuracy

@raivo-otus
Copy link
Contributor Author

Should the transformation default to using "relabundance" assay, or leave choice to user discretion?

As the binning is based on ranks, shouldn't counts and relabundance lead to same result?

My concern is with using e.g. CLR transformed values for the binning, which causes unexpected binning. In that sense it feels like a 'safetynet' to add some sort of check to use relabundance or counts.

@TuomasBorman
Copy link
Contributor

Ahh, yes. Maybe you could check that the values are positive and give error if not as the result does not make any sense.

@raivo-otus raivo-otus marked this pull request as ready for review January 22, 2026 10:05
@antagomir
Copy link
Member

The standard binning in R is done with function "cut". This is widely used and has many useful arguments. Just thinking whether that should be supported, or in general should multiple binning options be supported and their difference explained in the documentation.

But that can be another PR.

@raivo-otus
Copy link
Contributor Author

The standard binning in R is done with function "cut". This is widely used and has many useful arguments. Just thinking whether that should be supported, or in general should multiple binning options be supported and their difference explained in the documentation.

But that can be another PR.

I'll look into using cut. It should be possible to implement quantile based binning with cut aswell. It most likely is faster to use built-in functions where possible.

I think testing different binning methods would be beneficial. The quantile binning approach is supported by the BiomeGPT paper, but there are of course other ways like simple equal width bins etc. Seems logical to include in a separate PR to expand functionality if it seems useful to include other binning options.

@TuomasBorman
Copy link
Contributor

@raivo-otus any updates?

@raivo-otus
Copy link
Contributor Author

Added lit ref to the BiomeGPT paper describing the binning strategy, and changed the implementation to utilize cut().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants